Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce atomic slot migration #1591

Closed
wants to merge 18 commits into from

Conversation

murphyjacob4
Copy link
Contributor

Starting this PR in draft so we can review early and gather alignment.

Summary

The solution is mostly based on discussion and proposals in #23 :

  • Target driven: the user sends CLUSTER IMPORT to the target and it syncs down the slot
  • Uses AOF format for snapshot, proxying all commands directly from source primary through the target primary to all target replicas
  • Consensus-less epoch bump to broadcast the topology change

Import and Export Workflows

We utilize a new internal CLUSTER SYNCSLOTS command with additional sub-commands to transition through the slot migration state machine.

On either end of the migration, we track the ongoing migrations via two job queues: slot_import_jobs and slot_export_jobs. Right now, we only support one concurrent job in both the slot import and slot export job queue. There is no design restriction here (outside of perhaps some protocol additions to CLUSTER SYNCSLOTS - but it helps with simplicity for the first iteration.

Workflow overview

  1. User sends CLUSTER IMPORT SLOTSRANGE <slot start> <slot end> ... to target node T
  2. T initiates a new connection to source node S
  3. If required, T runs AUTH based on replication configuration
  4. T initiates CLUSTER SYNCSLOTS START to S
  5. S begins tracking the client that sent the command as the slot export client. It spawns a child process at the next available time and runs AOF rewrite with just the specified slots. It then begins accumulating a backlog of writes in the slot export client output buffer, without installing the write handler.
  6. At the end of the AOF rewrite, S also appends CLUSTER SYNCSLOTS ENDSNAPSHOT
  7. T processes the AOF rewrite as it would any other client using readQueryFromClient. Once it gets the CLUSTER SYNCSLOTS ENDSNAPSHOT, T sends back a CLUSTER SYNCSLOTS PAUSE to pause S.
  8. Upon getting the command, S unblocks the slot export client to T which has been accumulating ongoing writes. S then pauses itself, and sends CLUSTER SYNCSLOTS PAUSEOFFSET <offset> back to T with the current offset. (Note that the offset is not the primary replication offset, it is actually a computed offset based on how much we have been accumulating on T's client.)
  9. T waits for it's replication offset to catch up to the sent offset, and once caught up executes the consensus-less bump.
  10. S finds out about the bump via cluster gossip, unpauses itself, and cleans up dirty keys.

If at any point a client is disconnected on either end, or a timeout is reached on the target node, the migration is marked as failed. If a migration fails, we delete all keys in the slots we were migrating.

Filtering traffic

We filter the traffic to the target node through a filtered AOF rewrite and a filtered replication stream. The filtered AOF rewrite requires some refactoring of the snapshot code for reuse, but utilizes the same overall procedure (piping through the parent process to the target node connection).

The filtered replication stream hooks into the existing replication code and appends the commands directly to the client output buffer. We don't use replication backlog as there is no easy way to filter it once added to the backlog without re-processing it to query for the slot number of each command.

We add a new check in putClientInPendingWriteQueue to prevent the two command streams from merging. The parent process will just accumulate the replication stream in the client output buffer until we get the notification that the target is done with the snapshot.

Other notes

  • Since there are many checks that apply to both slot migration and replication - I added a new client flag "replicated" which simply means that a client is being replicated from another node. In many places, we now check the replicated flag instead of the primary flag.
  • To support non-contiguous slot ranges, the APIs and code use a slot bitmap to share information about what slots are being migrated.
  • We already delete unowned slots after loading RDB and AOF. Replicas that are promoted to primary now go through the same steps to delete all unowned keys. This prevents the keys from being leaked forever if the previous primary disappears.

Remaining work items

  • Continue cleaning up code, especially around the state machines
  • TCL tests
  • Metrics
  • Improved logging
  • Improved pause logic - where we only pause the slots being moved
  • Hide the in-progress import better for commands like KEYS and RANDOMKEY
  • Other management APIs (i.e. CLUSTER IMPORT STATUS, CLUSTER IMPORT CANCEL)
  • Incorporate dual-channel-style replication

Signed-off-by: Jacob Murphy <[email protected]>
Copy link

codecov bot commented Jan 20, 2025

Codecov Report

Attention: Patch coverage is 27.00730% with 400 lines in your changes missing coverage. Please review.

Project coverage is 70.58%. Comparing base (2a1a65b) to head (f1d824f).
Report is 12 commits behind head on unstable.

Files with missing lines Patch % Lines
src/cluster_legacy.c 10.73% 374 Missing ⚠️
src/kvstore.c 43.75% 9 Missing ⚠️
src/rdb.c 79.06% 9 Missing ⚠️
src/aof.c 50.00% 5 Missing ⚠️
src/io_threads.c 0.00% 1 Missing ⚠️
src/module.c 0.00% 1 Missing ⚠️
src/networking.c 96.77% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #1591      +/-   ##
============================================
- Coverage     70.78%   70.58%   -0.21%     
============================================
  Files           120      121       +1     
  Lines         65046    65595     +549     
============================================
+ Hits          46045    46302     +257     
- Misses        19001    19293     +292     
Files with missing lines Coverage Δ
src/blocked.c 91.38% <100.00%> (ø)
src/cluster.c 89.21% <100.00%> (ø)
src/commands.def 100.00% <ø> (ø)
src/db.c 89.57% <100.00%> (+<0.01%) ⬆️
src/object.c 82.05% <100.00%> (+0.01%) ⬆️
src/replication.c 87.63% <100.00%> (+0.18%) ⬆️
src/server.c 87.63% <100.00%> (+0.02%) ⬆️
src/server.h 100.00% <ø> (ø)
src/io_threads.c 6.94% <0.00%> (ø)
src/module.c 9.59% <0.00%> (-0.01%) ⬇️
... and 5 more

... and 14 files with indirect coverage changes

@hwware
Copy link
Member

hwware commented Jan 20, 2025

Do you have plan to implement the following 2 cases ?

  1. CLUSTER IMPORT CANCEL
  2. Source Primary failover during slot migration

@murphyjacob4
Copy link
Contributor Author

  1. CLUSTER IMPORT CANCEL

All this needs to do is call freeClient on the slot migration source client and delete all keys in the slot bitmap of that migration. The source is tracking the slot migration and when the client close notification comes in - it frees its local tracking information. This is the same process that already occurs if a migration times out.

  1. Source Primary failover during slot migration

The implementation plan would be:

  1. Where we currently check the timeout on the target side, also check that the source is still primary, and if not cancel the operation (as discussed above)
  2. On the source side, we would introduce a similar check, if we find out we are demoted, cancel the operation and close the slot migration link. Right now the source will just give up if that happens, however if we add retry in the future, subsequent retries to do CLUSTER SYNCSLOTS will fail with an error due to not being primary, which we would detect and fail the operation.

Note that due to consensus-less epoch bump - if there is a race between failover and slot migration - both may succeed, and later one of those will win deterministically based on the epoch collision protocol - so we will lose some writes on the epoch collision losing side due to a period of two nodes declaring primaryship. This is the same issue that exists in the current consensus-less slot migration implementation - and we are looking to address it as part of #1355 instead

Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
@murphyjacob4
Copy link
Contributor Author

Hi folks, so @enjoy-binbin, @PingXie, and I discussed offline. Tencent has offered a solution that they have developed internally. Moving forward, @enjoy-binbin and I will join efforts and work on bridging gaps in the Tencent solution to meet the requirements we outlined in #23. Hopefully we will have a shared PR for review soon.

Given this, I am going to go ahead and close this pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants